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Abstract 

Visual saliency is a fundamental problem in both cogni¬ 
tive and computational sciences, including computer vision. 
In this paper, we discover that a high-quality visual saliency 
model can be learned from multiscale features extracted 
using deep convolutional neural networks (CNNs), which 
have had many successes in visual recognition tasks. For 
learning such saliency models, we introduce a neural net¬ 
work architecture, which has fully connected layers on top 
of CNNs responsible for feature extraction at three different 
scales. We then propose a refinement method to enhance the 
spatial coherence of our saliency results. Finally, aggre¬ 
gating multiple saliency maps computed for different levels 
of image segmentation can further boost the performance, 
yielding saliency maps better than those generated from a 
single segmentation. To promote further research and eval¬ 
uation of visual saliency models, we also construct a new 
large database of 4447 challenging images and their pix- 
elwise saliency annotations. Experimental results demon¬ 
strate that our proposed method is capable of achieving 
state-of-the-art performance on all public benchmarks, im¬ 
proving the F-Measure by 5.0% and 13.2% respectively on 
the MSRA-B dataset and our new dataset (HKU-IS), and 
lowering the mean absolute error by 5.7% and 35.1% re¬ 
spectively on these two datasets. 

1. Introduction 

Visual saliency attempts to determine the amount of at¬ 
tention steered towards various regions in an image by the 
human visual and cognitive systems [ 6 ]. It is thus a fun¬ 
damental problem in psychology, neural science, and com¬ 
puter vision. Computer vision researchers focus on devel¬ 
oping computational models for either simulating the hu¬ 
man visual attention process or predicting visual saliency 
results. Visual saliency has been incorporated in a variety 
of computer vision and image processing tasks to improve 
their performance. Such tasks include image cropping [31], 
retargeting [4], and summarization [34]. Recently, visual 
saliency has also been increasingly used by visual recogni¬ 


tion tasks [32], such as image classification [36] and person 
re-identification [3' ]. 

Human visual and cognitive systems involved in the vi¬ 
sual attention process are composed of layers of intercon¬ 
nected neurons. For example, the human visual system has 
layers of simple and complex cells whose activations are de¬ 
termined by the magnitude of input signals falling into their 
receptive fields. Since deep artificial neural networks were 
originally inspired by biological neural networks, it is thus 
a natural choice to build a computational model of visual 
saliency using deep artificial neural networks. Specifically, 
recently popular convolutional neural networks (CNN) are 
particularly well suited for this task because convolutional 
layers in a CNN resemble simple and complex cells in the 
human visual system [14] while fully connected layers in a 
CNN resemble higher-level inference and decision making 
in the human cognitive system. 

In this paper, we develop a new computational model 
for visual saliency using multiscale deep features computed 
by convolutional neural networks. Deep neural networks, 
such as CNNs, have recently achieved many successes in 
visual recognition tasks [24, 12, 15, 17]. Such deep net¬ 
works are capable of extracting feature hierarchies from raw 
pixels automatically. Further, features extracted using such 
networks are highly versatile and often more effective than 
traditional handcrafted features. Inspired by this, we per¬ 
form feature extraction using a CNN originally trained over 
the ImageNet dataset [10]. Since ImageNet contains images 
of a large number of object categories, our features con¬ 
tain rich semantic information, which is useful for visual 
saliency because humans pay varying degrees of attention 
to objects from different semantic categories. For example, 
viewers of an image likely pay more attention to objects like 
cars than the sky or grass. In the rest of this paper, we call 
such features CNN features. 

By definition, saliency is resulted from visual contrast 
as it intuitively characterizes certain parts of an image that 
appear to stand out relative to their neighboring regions or 
the rest of the image. Thus, to compute the saliency of 
an image region, our model should be able to evaluate the 
contrast between the considered region and its surrounding 
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area as well as the rest of the image. Therefore, we extract 
multiscale CNN features for every image region from three 
nested and increasingly larger rectangular windows, which 
respectively encloses the considered region, its immediate 
neighboring regions, and the entire image. 

On top of the multiscale CNN features, our method fur¬ 
ther trains fully connected neural network layers. Con¬ 
catenated multiscale CNN features are fed into these layers 
trained using a collection of labeled saliency maps. Thus, 
these fully connected layers play the role of a regressor that 
is capable of inferring the saliency score of every image 
region from the multiscale CNN features extracted from 
nested windows surrounding the image region. It is well 
known that deep neural networks with at least one fully con¬ 
nected layers can be trained to achieve a very high level of 
regression accuracy. 

We have extensively evaluated our CNN-based visual 
saliency model over existing datasets, and meanwhile no¬ 
ticed a lack of large and challenging datasets for training 
and testing saliency models. At present, the only large 
dataset that can be used for training a deep neural network 
based model was derived from the MSRA-B dataset [2( ]. 
This dataset has become less challenging over the years 
because images there typically include a single salient ob¬ 
ject located away from the image boundary. To facilitate 
research and evaluation of advanced saliency models, we 
have created a large dataset where an image likely contains 
multiple salient objects, which have a more general spatial 
distribution in the image. Our proposed saliency model has 
significantly outperformed all existing saliency models over 
this new dataset as well as all existing datasets. 

In summary, this paper has the following contributions: 

• A new visual saliency model is proposed to incorpo¬ 
rate multiscale CNN features extracted from nested 
windows with a deep neural network with multiple 
fully connected layers. The deep neural network for 
saliency estimation is trained using regions from a set 
of labeled saliency maps. 

• A complete saliency framework is developed by fur¬ 
ther integrating our CNN-based saliency model with 
a spatial coherence model and multi-level image seg¬ 
mentations. 

• A new challenging dataset, HKU-IS, is created for 
saliency model research and evaluation. This dataset 
is publicly available. Our proposed saliency model has 
been successfully validated on this new dataset as well 
as on all existing datasets. 

1.1. Related Work 

Visual saliency computation can be categorized into 
bottom-up and top-down methods or a hybrid of the two. 


Bottom-up models are primarily based on a center-surround 
scheme, computing a master saliency map by a linear or 
non-linear combination of low-level visual attributes such 
as color, intensity, texture and orientation [19, 18, 1, 8, 26]. 
Top-down methods generally require the incorporation of 
high-level knowledge, such as objectness and face detector 
in the computation process [20, 7, 16, 33, 25]. 

Recently, much effort has been made to design discrim¬ 
inative features and saliency priors. Most methods essen¬ 
tially follow the region contrast framework, aiming to de¬ 
sign features that better characterize the distinctiveness of 
an image region with respect to its surrounding area. In 
[26], three novel features are integrated with a conditional 
random field. A model based on low-rank matrix recov¬ 
ery is presented in [3: ] to integrate low-level visual features 
with higher-level priors. 

Saliency priors, such as the center prior [26, 35, 23] and 
the boundary prior [22, 4 ], are widely used to heuristi- 
cally combine low-level cues and improve saliency estima¬ 
tion. These saliency priors are either directly combined with 
other saliency cues as weights [8, 9, 2< ] or used as features 
in learning based algorithms [22, 23, 25]. While these em¬ 
pirical priors can improve saliency results for many images, 
they can fail when a salient object is off-center or signifi¬ 
cantly overlaps with the image boundary. Note that object 
location cues and boundary-based background modeling are 
not neglected in our framework, but have been implicitly in¬ 
corporated into our model through multiscale CNN feature 
extraction and neural network training. 

Convolutional neural networks have recently achieved 
many successes in visual recognition tasks, including image 
classification [24], object detection [15], and scene pars¬ 
ing [12]. Donahue et al.[ll] pointed out that features ex¬ 
tracted from Krizhevsky’s CNN trained on the ImageNet 
dataset [] ] can be repurposed to generic tasks. Razavian 
et al.[3 ] extended their results and concluded that deep 
learning with CNNs can be a strong candidate for any vi¬ 
sual recognition task. Nevertheless, CNN features have not 
yet been explored in visual saliency research primarily be¬ 
cause saliency cannot be solved using the same framework 
considered in [11, 30]. It is the contrast against the sur¬ 
rounding area rather than the content inside an image region 
that should be learned for saliency prediction. This paper 
proposes a simple but very effective neural network archi¬ 
tecture to make deep CNN features applicable to saliency 
modeling and salient object detection. 

2. Saliency Inference with Deep Features 

As shown in Fig. 1, the architecture of our deep feature 
based model for visual saliency consists of one output layer 
and two fully connected hidden layers on top of three deep 
convolutional neural networks. Our saliency model requires 
an input image to be decomposed into a set of nonoverlap- 
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Figure 1: The architecture of our deep feature based visual 
saliency model. 

ping regions, each of which has almost uniform saliency 
values internally. The three deep CNNs are responsible for 
multiscale feature extraction. For each image region, they 
perform automatic feature extraction from three nested and 
increasingly larger rectangular windows, which are respec¬ 
tively the bounding box of the considered region, the bound¬ 
ing box of its immediate neighboring regions, and the entire 
image. The features extracted from the three CNNs are fed 
into the two fully connected layers, each of which has 300 
neurons. The output of the second fully-connected layer 
is fed into the output layer, which performs two-way soft- 
max that produces a distribution over binary saliency labels. 
When generating a saliency map for an input image, we run 
our trained saliency model repeatedly over every region of 
the image to produce a single saliency score for that region. 
This saliency score is further transferred to all pixels within 
that region. 

2.1. Multiscale Feature Extraction 

We extract multiscale features for each image region 
with a deep convolutional neural network originally trained 
over the ImageNet dataset [ 10 ] using Caffe [ 21 ], an open 
source framework for CNN training and testing. The archi¬ 
tecture of this CNN has eight layers including five convo¬ 
lutional layers and three fully-connected layers. Features 
are extracted from the output of the second last fully con¬ 
nected layer, which has 4096 neurons. Although this CNN 
was originally trained on a dataset for visual recognition, 
automatically extracted CNN features turn out to be highly 
versatile and can be more effective than traditional hand¬ 


crafted features on other visual computing tasks. 

Since an image region may have an irregular shape while 
CNN features have to be extracted from a rectangular re¬ 
gion, to make the CNN features only relevant to the pix¬ 
els inside the region, as in [1. ], we define the rectangular 
region for CNN feature extraction to be the bounding box 
of the image region and fill the pixels outside the region 
but still inside its bounding box with the mean pixel values 
at the same locations across all ImageNet training images. 
These pixel values become zero after mean subtraction and 
do not have any impact on subsequent results. We warp 
the region in the bounding box to a square with 227x227 
pixels to make it compatible with the deep CNN trained 
for ImageNet. The warped RGB image region is then fed 
to the deep CNN and a 4096-dimensional feature vector is 
obtained by forward propagating a mean-subtracted input 
image region through all the convolutional layers and fully 
connected layers. We name this vector feature A. 

Feature A itself does not include any information around 
the considered image region, thus is not able to tell whether 
the region is salient or not with respect to its neighborhood 
as well as the rest of the image. To include features from 
an area surrounding the considered region for understand¬ 
ing the amount of contrast in its neighborhood, we extract 
a second feature vector from a rectangular neighborhood, 
which is the bounding box of the considered region and its 
immediate neighboring regions. All the pixel values in this 
bounding box remain intact. Again, this rectangular neigh¬ 
borhood is fed to the deep CNN after being warped. We call 
the resulting vector from the CNN feature B. 

As we know, a very important cue in saliency compu¬ 
tation is the degree of (color and content) uniqueness of a 
region with respect to the rest of the image. The position of 
an image region in the entire image is another crucial cue. 
To meet these demands, we use the deep CNN to extract 
feature C from the entire rectangular image, where the con¬ 
sidered region is masked with mean pixel values for indicat¬ 
ing the position of the region. These three feature vectors 
obtained at different scales together define the features we 
adopt for saliency model training and testing. Since our fi¬ 
nal feature vector is the concatenation of three CNN feature 
vectors, we call it S-3CNN. 

2.2. Neural Network Training 

On top of the multiscale CNN features, we train a neu¬ 
ral network with one output layer and two fully connected 
hidden layers. This network plays the role of a regressor 
that infers the saliency score of every image region from 
the multiscale CNN features extracted for the image region. 
It is well known that neural networks with fully connected 
hidden layers can be trained to reach a very high level of 
regression accuracy. 

Concatenated multiscale CNN features are fed into this 


























network, which is trained using a collection of training im¬ 
ages and their labeled saliency maps, that have pixelwise bi¬ 
nary saliency scores. Before training, every training image 
is first decomposed into a set of regions. The saliency label 
of every image region is further estimated using pixelwise 
saliency labels. During the training stage, only those re¬ 
gions with 70% or more pixels with the same saliency label 
are chosen as training samples, and their saliency labels are 
set to either 1 or 0 respectively. During training, the output 
layer and the fully connected hidden layers together min¬ 
imize the least-squares prediction errors accumulated over 
all regions from all training images. 

Note that the output of the penultimate layer of our neu¬ 
ral network is indeed a fine-tuned feature vector for saliency 
detection. Traditional regression techniques, such as sup¬ 
port vector regression and random forests, can be further 
trained on this feature vector to generate a saliency score for 
every image region. In our experiments, we found that this 
feature vector is very discriminative and the simple logistic 
regression embedded in the final layer of our architecture is 
strong enough to generate state-of-the-art performance on 
all visual saliency datasets. 

3. The Complete Algorithm 

3.1. Multi-Level Region Decomposition 

A variety of methods can be applied to decompose an im¬ 
age into nonoverlapping regions. Examples include grids, 
region growing, and pixel clustering. Hierarchical image 
segmentation can generate regions at multiple scales to sup¬ 
port the intuition that a semantic object at a coarser scale 
may be composed of multiple parts at a finer scale. To en¬ 
able a fair comparison with previous work on saliency es¬ 
timation, we follow the multi-level region decomposition 
pipeline in [22]. Specifically, for an image /, M levels of 
image segmentations, S = {Si, S 2 ,Sm}(|S;| “A*), 
are constructed from the finest to the coarsest scale. The 
regions at any level form a nonoverlapping decomposition. 
The hierarchical region merge algorithm in [3] is applied to 
build a segmentation tree for the image. The initial set of 
regions are called superpixels. They are generated using 
the graph-based segmentation algorithm in [13]. Region 
merge is prioritized by the edge strength at the boundary 
pixels shared by two adjacent regions. Regions with lower 
edge strength between them are merged earlier. The edge 
strength at a pixel is determined by a real-valued ultramet¬ 
ric contour map (UCM). In our experiments, we normal¬ 
ize the value of UCM into [0,1] and generate 15 levels of 
segmentations with different edge strength thresholds. The 
edge strength threshold for level i is adjusted such that the 
number of regions reaches a predefined target. The target 
number of regions at the finest and coarsest levels are set 
to 300 and 20 respectively, and the number of regions at 


intermediate levels follows a geometric series. 

3.2. Spatial Coherence 

Given a region decomposition of an image, we can gen¬ 
erate an initial saliency map with the neural network model 
presented in the previous section. However, due to the fact 
that image segmentation is imperfect and our model as¬ 
signs saliency scores to individual regions, noisy scores in¬ 
evitably appear in the resulting saliency map. To enhance 
spatial coherence, a superpixel based saliency refinement 
method is used. The saliency score of a superpixel is set 
to the mean saliency score over all pixels in the superpixel. 
The refined saliency map is obtained by minimizing the fol¬ 
lowing cost function, which can be reduced to solving a lin¬ 
ear system. 

p ( a ? - a i ) +53 ^ ( a ^ _ a f) > (i) 

i i,j 

where af is the initial saliency score at superpixel i, af 
is the refined saliency score at the same superpixel. The 
first term in (1) encourages similarity between the refined 
saliency map and the initial saliency map, while the second 
term is an all-pair spatial coherence term that favors con¬ 
sistent saliency scores across different superpixels if there 
do not exist strong edges separating them, w^ is the spatial 
coherence weight between any pair of superpixels Pi and 
P.r 

To define pairwise weights Wij , we construct an undi¬ 
rected weighted graph on the set of superpixels. There is 
an edge in the graph between any pair of adjacent super¬ 
pixels (P^ Pj ), and the distance between them is defined as 
follows, 


d{Pi,Pi) 


T,pe (n Pi n Pj u Pi n n Pj ) ES (p ) 

i^n^u^n^i 


( 2 ) 


where ES(p ) is the edge strength at pixel p and Qp repre¬ 
sents the set of pixels on the outside boundary of superpixel 
P. We again make use of the UCM proposed in [3] to de¬ 
fine edge strength here. The distance between any pair of 
non-adjacent superpixels is defined as the shortest path dis¬ 
tance in the graph. The spatial coherence weight is thus 

defined as = exp d where a is set to the 

standard deviation of pairwise distances in our experiments. 
This weight is large when two superpixels are located in the 
same homogeneous region and small when they are sepa¬ 
rated by strong edges. 

3.3. Saliency Map Fusion 

We apply both our neural network model and spatial 
coherence refinement to each of the M levels of segmen¬ 
tation. As a result, we obtain M refined saliency maps, 




{A^\A^ 2 \ ..., interpreting salient parts of the in¬ 

put image at various granularity. We aim to further fuse 
them together to obtain a final aggregated saliency map. To 
this end, we take a simple approach by assuming the final 
saliency map is a linear combination of the maps at indi¬ 
vidual segmentation levels, and learn the weights in the lin¬ 
ear combination by running a least-squares estimator over 
a validation dataset, indexed with I v . Thus, our aggregated 
saliency map A is formulated as follows, 


A — Ya k A^ 


s.t. {a k }%Li argmin Y, 

a!,a 2 . eIv 


Ai - Y a kA. 


(k) 


(3) 


Note that there are many options for saliency fusion. For 
example, a conditional random field (CRF) framework has 
been adopted in [27] to aggregate multiple saliency maps 
from different methods. Nevertheless, we have found that, 
in our context, a linear combination of all saliency maps 
can already serve our purposes well and is capable of pro¬ 
ducing aggregated maps with a quality comparable to those 
obtained from more complicated techniques. 


4. A New Dataset 

At present, the pixelwise ground truth annotation [2' ] 
of the MSRA-B dataset [26] is the only large dataset that 
is suitable for training a deep neural network. Neverthe¬ 
less, this benchmark becomes less challenging once a cen¬ 
ter prior and a boundary prior [22, 40] have been imposed 
since most images in the dataset contain only one connected 
salient region and 98% of the pixels in the border area be¬ 
longs to the background [22]. 

We have constructed a more challenging dataset to fa¬ 
cilitate the research and evaluation of visual saliency mod¬ 
els. To build the dataset, we initially collected 7320 images. 
These images were chosen by following at least one of the 
following criteria: 

1. there are multiple disconnected salient objects; 

2. at least one of the salient objects touches the image 
boundary; 


and a^ = 0 otherwise. We define label consistency as the 
ratio between the number of pixels labeled as salient by all 
three people and the number of pixels labeled as salient by 
at least one of the people. It is formulated as 




We excluded those images with label consistency C < 
0.9, and 4447 images remained. For each image that passed 
the label consistency test, we generated a ground truth 
saliency map from the annotations of three people. The 
pixelwise saliency label in the ground truth saliency map, 
G = {g x \g x £ {0,1}}, is determined according to the ma¬ 
jority label among the three people as follows, 


9x — 1 



( 5 ) 


At the end, our new saliency dataset, called HKU-IS, 
contains 4447 images with high-quality pixelwise annota¬ 
tions. All the images in HKU-IS satisfy at least one of 
the above three criteria while 2888 (out of 5000) images 
in the MSRA dataset do not satisfy any of these criteria. 
In summary, 50.34% images in HKU-IS have multiple dis¬ 
connected salient objects while this number is only 6.24% 
for the MSRA dataset; 21% images in HKU-IS have salient 
objects touching the image boundary while this number is 
13% for the MSRA dataset; and the mean color contrast of 
HKU-IS is 0.69 while that of the MSRA dataset is 0.78. 


5. Experimental Results 
5.1. Dataset 

We have evaluated the performance of our method on 
several public visual saliency benchmarks as well as on our 
own dataset. 

MSRA-B[26]. This dataset has 5000 images, and is widely 
used for visual saliency estimation. Most of the images con¬ 
tain only one salient object. Pixelwise annotation was pro¬ 
vided by [ 22 ]. 


3. the color contrast (the minimum Chi-square distance 
between the color histograms of any salient object and 
its surrounding regions) is less than 0.7. 

To reduce label inconsistency, we asked three people to an¬ 
notate salient objects in all 7320 images individually using 
a custom designed interactive segmentation tool. On aver¬ 
age, each person takes 1-2 minutes to annotate one image. 
The annotation stage spanned over three months. 

Let A p = {a^ } be the binary saliency mask labeled by 

(rp) 

the p -th user. And a x = 1 if pixel x is labeled as salient 


SED[2]. It contains two subsets: SED1 and SED2. SED1 
has 100 images each containing only one salient object 
while SED2 has 100 images each containing two salient ob¬ 
jects. 

SOD[28]. This dataset has 300 images, and it was originally 
designed for image segmentation. Pixelwise annotation of 
salient objects in this dataset was generated by [2^ ]. This 
dataset is very challenging since many images contain mul¬ 
tiple salient objects either with low contrast or overlapping 
with the image boundary. 
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Figure 2: Visual comparison of saliency maps generated from 10 different methods, including ours (MDF). The ground truth 
(GT) is shown in the last column. MDF consistently produces saliency maps closest to the ground truth. We compare MDF 
against spectral residual (SR[18]), frequency-tuned saliency (FT [1]), saliency filters (SF [29]), geodesic saliency (GS [35]), 
hierarchical saliency (HS [37]), regional based contrast (RC [8]), manifold ranking (MR [38]), optimized weighted contrast 
(wCtr* [40]) and discriminative regional feature integration (DRFI [22]). 


iCoSeg[5]. This dataset was designed for co-segmentation. 
It contains 643 images with pixelwise annotation. Each im¬ 
age may contain one or multiple salient objects. 

HKU-IS. Our new dataset contains 4447 images with pix¬ 
elwise annotation of salient objects. 

To facilitate a fair comparison with other methods, we 
divided the MSRA dataset into three parts as in [22], 2500 
for training, 500 for validation and the remaining 2000 im¬ 
ages for testing. Since other existing datasets are too small 
to train reliable models, we directly applied a trained model 
to generate their saliency maps as in [2/ ]. We also divided 
HKU-IS into three parts, 2500 images for training, 500 im¬ 
ages for validation and the remaining 1447 images for test¬ 
ing. The images for training and validation were randomly 
chosen from the entire dataset. 

While it takes around 20 hours to train our deep neural 
network based prediction model for 15 image segmentation 
levels using the MSRA dataset, it only takes 8 seconds to 
detect salient objects in a testing image with 400x300 pix¬ 
els on a PC with an NVIDIA GTX Titan Black GPU and a 
3.4GHz Intel processor using our MATLAB code. 

5.2. Evaluation Criteria 

Following [1, 8], we first use standard precision-recall 
curves to evaluate the performance of our method. A con¬ 
tinuous saliency map can be converted into a binary mask 
using a threshold, resulting in a pair of precision and re¬ 
call values when the binary mask is compared against the 
ground truth. A precision-recall curve is then obtained by 


varying the threshold from 0 to 1. The curves are averaged 
over each dataset. 

Second, since high precision and high recall are both de¬ 
sired in many applications, we compute the F-Measure[l] 
as 

(1 + /3 2 ) • Precision • Recall 
P 02 . p re dsion + Recall ’ 

where /3 2 is set to 0.3 to weigh precision more than recall 
as suggested in [ ]. We report the performance when each 
saliency map is binarized with an image-dependent thresh¬ 
old proposed by [ 1 ]. This adaptive threshold is determined 
to be twice the mean saliency of the image: 

2 W H 

T - = WlTH^ six ’ A (7) 

x=l y=l 

where W and H are the width and height of the saliency 
map S, and S(x,y) is the saliency value of the pixel at 
(x,y). We report the average precision, recall and F- 
measure over each dataset. 

Although commonly used, precision-recall curves have 
limited value because they fail to consider true negative 
pixels. For a more balanced comparison, we adopt the 
mean absolute error (MAE) as another evaluation criterion. 
It is defined as the average pixelwise absolute difference 
between the binary ground truth G and the saliency map 
S [29], 

1 W H 

MAE =WVhT. E \S(x,y) - G(x,y)\. (8) 

x=ly =1 
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Figure 3: Quantitative comparison of saliency maps generated from 10 different methods on 4 datasets. From left to right: 
(a) the MSRA-B dataset, (b) the SOD dataset, (c) the iCoSeg dataset, and (d) our new HKU-IS dataset. From top to bottom: 
(1st row) the precision-recall curves of different methods, (2nd row) the precision, recall and F-measure using an adaptive 
threshold, and (3rd row) the mean absolute error. 


MAE measures the numerical distance between the ground 
truth and the estimated saliency map, and is more meaning¬ 
ful in evaluating the applicability of a saliency model in a 
task such as object segmentation. 

5.3. Comparison with the State of the Art 

Let us compare our saliency model (MDF) with a num¬ 
ber of existing state-of-the-art methods, including dis¬ 
criminative regional feature integration (DRFI) [22], op¬ 
timized weighted contrast (wCtr*) [4' ], manifold ranking 
(MR) [38], regional based contrast (RC) [! ], hierarchical 
saliency (HS) [3 ], geodesic saliency (GS) [35], saliency 
filters (SF) [29], frequency-tuned saliency (FT) [] ] and the 
spectral residual approach (SR) [1 ]. For RC, FT and SR, 
we use the implementation provided by [8]; for other meth¬ 
ods, we use original codes with recommended parameter 
settings. 

A visual comparison is given in Fig. 2. As can be 
seen, our method performs well in a variety of challenging 
cases, e.g., multiple disconnected salient objects (the first 
two rows), objects touching the image boundary (the sec¬ 


ond row), cluttered background (the third and fourth rows), 
and low contrast between object and background (the last 
two rows). 

As part of the quantitative evaluation, we first evaluate 
our method using precision-recall curves. As shown in the 
first row of Fig. 3, our method achieves the highest preci¬ 
sion in almost the entire recall range on all datasets. Preci¬ 
sion, recall and F-measure results using the aforementioned 
adaptive threshold are shown in the second row of Figure 
3, sorted by the F-measure. Our method also achieves the 
best performance on the overall F-measure as well as signif¬ 
icant increases in both precision and recall. On the MSRA- 
B dataset, our method achieves 86.4% precision and 87.0% 
recall while the second best (MR) achieves 84.8% preci¬ 
sion and 76.3% recall. Performance improvement becomes 
more obvious on HKU-IS. Compared with the second best 
(DRFI), our method increases the F-measure from 0.71 to 
0.80, and achieves an increase of 9% in precision while at 
the same time improving the recall by 5.7%. Similar con¬ 
clusions can also be made on other datasets. Note that the 
precision of certain methods, including MR[38], DRFI[22], 
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Figure 4: Component-wise efficacy in our visual saliency model, (a) and (b) show the effectiveness of our S-3CNN feature, 
(a) shows the precision-recall curves of models trained on MSRA-B using different components of S-3CNN, while (b) shows 
the corresponding precision, recall and F-measure using an adaptive threshold, (c) and (d) show the effectiveness of spatial 
coherence and multilevel fusion. refers to models with spatial coherence. “Layerl”, “Layer2” and “Layer3” refer to the 
three segmentation levels that have the highest single-level saliency prediction performance. 


HS[37] and wCtr*[40], is comparable to ours while their re¬ 
calls are often much lower. Thus it is more likely for them 
to miss salient pixels. This is also reflected in the lower 
F-measure and higher MAE. Refer to the supplemental ma¬ 
terials for the results on the SED dataset. 

The third row of Fig. 3 shows that our method also 
significantly outperforms other existing methods in terms 
of the MAE measure, which provides a better estimation 
of the visual distance between the predicted saliency map 
and the ground truth. Our method successfully lowers the 
MAE by 5.7% with respect to the second best algorithm 
(wCtr*) on the MSRA-B dataset. On two other datasets, 
iCoSeg and SOD, our method lowers the MAE by 26.3% 
and 17.1% respectively with respect to the second best al¬ 
gorithms. On HKU-IS, which contains more challenging 
images, our method significantly lowers the MAE by 35.1% 
with respect to the second best performer on this dataset 
(wCtr*). 

In summary, the improvement our method achieves over 
the state of the art is substantial. Furthermore, the more 
challenging the dataset, the more obvious the advantages 
because our multiscale CNN features are capable of char¬ 
acterizing the contrast relationship among different parts of 
an image. 

5.4. Component-wise Efficacy 

Effectiveness of S-3CNN As discussed in Section 2.1, 
our multiscale CNN feature vector, S-3CNN, consists of 
three components, A, B and C. To show the effectiveness 
and necessity of these three parts, we have trained five ad¬ 
ditional models for comparison, which respectively take 
feature A only, feature B only, feature C only, con¬ 
catenated A and B, and concatenated A and C. These five 
models were trained on MSRA-B using the same setting as 
the one taking S-3CNN. Quantitative results were obtained 
on the testing images in the MSRA-B dataset. As shown 


in Fig. 4, the model trained using S-3CNN consistently 
achieves the best performance on average precision, recall 
and F-measure. Models trained using two components per¬ 
form much better than those trained using a single compo¬ 
nent. These results demonstrate that the three components 
of our multiscale CNN feature vector are complementary 
to each other, and the training stage of our saliency model 
is capable of discovering and understanding region contrast 
information hidden in our multiscale features. 

Spatial Coherence In Section 3.2, spatial coherence was 
incorporated to refine the saliency scores from our CNN- 
based model. To validate its effectiveness, we have evalu¬ 
ated the performance of our final saliency model with and 
without spatial coherence using the testing images in the 
MSRA-B dataset. We further chose the three segmentation 
levels that have the highest single-level saliency prediction 
performance, and compared their performance with spatial 
coherence turned on and off. The resulting precision-recall 
curves are shown in Fig. 4. It is evident that spatial coher¬ 
ence clearly improves the accuracy of our models. 

Multilevel Decomposition Our method exploits informa¬ 
tion from multiple levels of image segmentation. As shown 
in Fig. 4, the performance of a single segmentation level 
is not comparable to the performance of the fused model. 
The aggregated saliency map from 15 levels of image seg¬ 
mentation improves the average precision by 2.15% and at 
the same time improves the recall rate by 3.47% when it is 
compared with the result from the best-performing single 
level. 
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